This will be my (i.e. Baran’s) project page for the assignments of the course PHD-302 Introduction to Open Data Science, University of Helsinki.
Here is my GitHub repository: https://github.com/bbayraktaroglu/IODS-project
date()
## [1] "Tue Nov 29 05:23:51 2022"
I have never learned R in my previous programming courses, so this was a little bit of an overwhelming start for me. This is also officially my first time using GitHub. I tried to use it before for some personal projects, but failed to understand its general use. I recently heard about the course through the DONASCI emailing list, and I previously heard about the course from a friend of mine who also suggested for me to take the course. I expect to learn whatever I can about data science in an advanced level. The last time I took a rigorous statistics course (or any programming language course) was during my bachelor studies.
##Thoughts about Exercise set 1 and the R for Health Data Science book:
It seems that the exercise set and the first few chapters of the book provide the basic essentials for learning R as a programming language, with its own quirks and conventions. I found the “pipe function”, i.e. “%>%”, the most out of ordinary way of assigning an input to a function. It is sometimes hard to understand why it is used instead of the generic way of computing a function. Other than this, R seems to be an intuitive language, with easy to understand commands.
R Markdown seems to be a very neat notebook like Latex compilers, or Jupyter notebook. I found it easy to understand, but it will take time to get used to its various syntax.
This week I have worked on linear regression. To be honest, last time I studied this subject was almost 8 years ago during my bachelor studies, and although the subject is quite easy, I still find some parts quite fascinating. I have never worked with R, so this week was more of an introduction to hands-on R experience compared to last week’s assignment. R syntax seems intuitive, compared to less user-friendly languages like C or even Java.
date()
## [1] "Tue Nov 29 05:23:51 2022"
The task for data wrangling seemed daunting at first, but the individual steps were already built from the ground up in the exercise set, so I have not gotten into any trouble.
library(tidyverse)
library(dplyr)
library(ggplot2)
library(GGally)
library(purrr)
# reading the required file for the assignment
students2014 <- read.csv("learning2014.csv", sep = ",", header = TRUE)
We now compute the dimensions of the data and look at its structure:
dim(students2014)
## [1] 166 7
str(students2014)
## 'data.frame': 166 obs. of 7 variables:
## $ gender : chr "F" "M" "F" "M" ...
## $ age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: num 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ points : int 25 12 24 10 22 21 21 31 24 26 ...
Description of the dataset:
There are 166 observations (each representing a student) and 7 variables in this dataset. The data as a whole was collected as a survey between 2014 and 2015. The variables which are selected for this assignment try to keep track of what type of pedagogical learning method the students used, their overall attitude towards statistics, together with information about their gender, age and exam points. Here is a table with the definitions of the variables:
| Variable | Variable Type | Definition |
|---|---|---|
| gender | Character | gender of the student, M(male)/F(female) |
| age | Integer | age of the student |
| attitude | Numeric (double) | average of student’s overall attitude toward statistics, scale between 1-5 |
| deep | Numeric (double) | deep learning metric, scale between 1-5 |
| stra | Numeric (double) | strategic learning metric, scale between 1-5 |
| surf | Numeric (double) | surface learning metric, scale between 1-5 |
| points | Integer | exam points of the student, scale between 1-5 |
We draw a graphical overview of the dataset:
ggpairs(students2014, mapping = aes(col=gender, alpha=0.3), lower = list(combo = wrap("facethist", bins = 20)))
We also show summaries of the variables:
summary(students2014)
## gender age attitude deep
## Length:166 Min. :17.00 Min. :1.400 Min. :1.583
## Class :character 1st Qu.:21.00 1st Qu.:2.600 1st Qu.:3.333
## Mode :character Median :22.00 Median :3.200 Median :3.667
## Mean :25.51 Mean :3.143 Mean :3.680
## 3rd Qu.:27.00 3rd Qu.:3.700 3rd Qu.:4.083
## Max. :55.00 Max. :5.000 Max. :4.917
## stra surf points
## Min. :1.250 Min. :1.583 Min. : 7.00
## 1st Qu.:2.625 1st Qu.:2.417 1st Qu.:19.00
## Median :3.188 Median :2.833 Median :23.00
## Mean :3.121 Mean :2.787 Mean :22.72
## 3rd Qu.:3.625 3rd Qu.:3.167 3rd Qu.:27.75
## Max. :5.000 Max. :4.333 Max. :33.00
Comments about the outputs
One can see from the graphical overview the scatter plot, the correlations, and the probability distributions of pairs of each of the variables. And from the summary, one can see the various minima, maxima and mean. Female gender is colored in red, while the male gender is colored in blue.
We see that there are considerably more females than males in the study. Females seem to be much younger than the average male, and the females’ attitude towards statistics seem to be considerably lower than their male counterparts. There seems to be a strong positive correlation between attitude and exam points, for both genders. Interestingly, there is a strong negative correlation between attitude and surface learning for males, while there is no significant conclusion for females. Similarly for the correlation between surface and deep learning. Negative correlation means that male students who prefer surface learning are more likely to have a negative attitude towards statistics.
We choose the variables attitude, strategic learning and surface learning as explanatory variables, and construct a linear regression for the dependent variable “exam points”.
my_model <- lm(points ~ attitude + stra +surf, data = students2014)
summary(my_model)
##
## Call:
## lm(formula = points ~ attitude + stra + surf, data = students2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.0171 3.6837 2.991 0.00322 **
## attitude 3.3952 0.5741 5.913 1.93e-08 ***
## stra 0.8531 0.5416 1.575 0.11716
## surf -0.5861 0.8014 -0.731 0.46563
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
Comments about the result:
One can see the the adjusted R-squared value is 0.1927, which means the variables attitude, strategic learning and surface learning can explain up to 19.27% deviation within the exam points of a student. Moreover, attitude is considerably significant with a p-value of about \(1.93*10^{-8}\), much less than the general lowest threshold of 0.001. Unfortunately, the other variables are not significant, with p-values above 0.1. So it is highly unlikely that strategic learning and surface learning have an explanatory power as much as attitude. The model has an overall p-value of \(3.156*10^{-8}\), which is very low, so the model is significant overall.
We now remove the variables stra and surf, since both are not very statistically significant, and try to form a new model:
my_model2 <- lm(points ~ attitude , data = students2014)
summary(my_model2)
##
## Call:
## lm(formula = points ~ attitude, data = students2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.6372 1.8303 6.358 1.95e-09 ***
## attitude 3.5255 0.5674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
Comments about the new result:
As expected, the new result improved the statistical significance of the remaining explanatory variable “attitude”, to about \(4.12*10^{-9}\). The overall p-value is also the same as the one for attitude since we are now using a univariate linear regression. Thus, when we compare this model to the previous one, there has been a significant increase in the trustworthiness of the model. But the adjusted R-square value is now 0.1856, which is lower than the previous one. This means that the model has lost some explanatory power, and now can explain up to 18.56% deviation within the exam points. This is expected, since if we remove variables from a model, the explanatory power is expected to decrease, but not by much. The multiple R-squared is not an important metric in this case, since we only have one explanatory variable.
# place the following four graphics in same plot
par(mfrow = c(2,2))
# draw diagnostic plots for the final model
plot(my_model2, which = c(1,2,5))
Final comments about the diagnostics:
The final model seems to be fitting our expectations. Q-Q plot is mostly along the line, which means that the distribution of the model mostly follows that of the normal distribution. Residuals vs Fitted plot shows us that most of the points follow along the line residual=0 in a horizontal strip, which means that the result is well-behaved. There are no obvious outliers, and the result seems random enough. So the assumption of linearity is well-supported. Finally, Residuals vs Leverage plot tells us that there are two data points (namely 56 and 35) sitting very close to Cook’s distance, but they do not fall outside of it. Thus none of the data points possess any influential effect on the regression model, but further analysis on the data points 56 and 35 can be made just to be sure.
This week I have worked on logistic regression. Slowly but surely, I am starting to feel comfortable with R and RMarkdown. I hope next week is going to be even more easier for me. Using some peer reviews that I have obtained last week, there was an overhaul in my course diary. Now, it should look nicer.
date()
## [1] "Tue Nov 29 05:24:08 2022"
This week, data wrangling felt considerably easy. I followed the tasks and used some help from the Exercise 3. The R code of the data wrangling part is in the data folder of my Github repository. I will put the link here as well: https://github.com/bbayraktaroglu/IODS-project/blob/master/data/create_alc.R
library(tidyverse)
library(tidyr)
library(dplyr)
library(ggplot2)
library(readr)
library(boot)
library(GGally)
library(purrr)
library(gmodels)
library(knitr)
library(patchwork)
library(finalfit)
library(stringr)
library(caTools)
library(caret)
# set the working directory
setwd("/Users/barancik/Github/IODS-project/data")
# reading the required file for the assignment
alc <- read.csv("alc.csv", sep = ",", header = TRUE)
We now compute the dimensions of the data and look at its structure:
dim(alc)
## [1] 370 35
str(alc)
## 'data.frame': 370 obs. of 35 variables:
## $ school : chr "GP" "GP" "GP" "GP" ...
## $ sex : chr "F" "F" "F" "F" ...
## $ age : int 18 17 15 15 16 16 16 17 15 15 ...
## $ address : chr "U" "U" "U" "U" ...
## $ famsize : chr "GT3" "GT3" "LE3" "GT3" ...
## $ Pstatus : chr "A" "T" "T" "T" ...
## $ Medu : int 4 1 1 4 3 4 2 4 3 3 ...
## $ Fedu : int 4 1 1 2 3 3 2 4 2 4 ...
## $ Mjob : chr "at_home" "at_home" "at_home" "health" ...
## $ Fjob : chr "teacher" "other" "other" "services" ...
## $ reason : chr "course" "course" "other" "home" ...
## $ guardian : chr "mother" "father" "mother" "mother" ...
## $ traveltime: int 2 1 1 1 1 1 1 2 1 1 ...
## $ studytime : int 2 2 2 3 2 2 2 2 2 2 ...
## $ schoolsup : chr "yes" "no" "yes" "no" ...
## $ famsup : chr "no" "yes" "no" "yes" ...
## $ activities: chr "no" "no" "no" "yes" ...
## $ nursery : chr "yes" "no" "yes" "yes" ...
## $ higher : chr "yes" "yes" "yes" "yes" ...
## $ internet : chr "no" "yes" "yes" "yes" ...
## $ romantic : chr "no" "no" "no" "yes" ...
## $ famrel : int 4 5 4 3 4 5 4 4 4 5 ...
## $ freetime : int 3 3 3 2 3 4 4 1 2 5 ...
## $ goout : int 4 3 2 2 2 2 4 4 2 1 ...
## $ Dalc : int 1 1 2 1 1 1 1 1 1 1 ...
## $ Walc : int 1 1 3 1 2 2 1 1 1 1 ...
## $ health : int 3 3 3 5 5 5 3 1 1 5 ...
## $ failures : int 0 0 2 0 0 0 0 0 0 0 ...
## $ paid : chr "no" "no" "yes" "yes" ...
## $ absences : int 5 3 8 1 2 8 0 4 0 0 ...
## $ G1 : int 2 7 10 14 8 14 12 8 16 13 ...
## $ G2 : int 8 8 10 14 12 14 12 9 17 14 ...
## $ G3 : int 8 8 11 14 12 14 12 10 18 14 ...
## $ alc_use : num 1 1 2.5 1 1.5 1.5 1 1 1 1 ...
## $ high_use : logi FALSE FALSE TRUE FALSE FALSE FALSE ...
Description of the dataset:
There are 370 observations (each representing a student) and 35 variables in this dataset. The data as a whole was collected as a survey on 27.11.2014, from two different Portuguese schools. The data consists of measurements regarding success of the students in two different subjects: Mathematics and Portuguese language. The variables in this assignment try to keep track of some background information about the student, like age, sex, etc., and important measures regarding the success of the students such as number of past class failures, number of school absences, current health status, alcohol consumption, etc. Some of the variables are binary like the sex or internet access, some are numeric, and some are nominal answers like ‘mother’s job’. Some numeric ones are between 1-5, while some are not bounded. The grades (G1, G2, G3) are between 0-20, and each represent grades obtained in different periods.
The dataset for maths and Portuguese language are combined by taking averages, including the grade variables. We combined the data into one single data which only includes the students who took both courses. The variable ‘alc_use’ is the average of workday alcohol consumption and weekend alcohol consumption. ‘high_use’ is TRUE if ‘alc_use’ is higher than 2 and FALSE otherwise.
Our main aim is to understand the relationship between alcohol consumption and other variables in the data. We choose the variables ‘failures’, ‘absences’, ‘sex’ and ‘famrel’. We hypothesize that there is a correlation between ‘high_use’ and ‘failures’ (number of past failures), ‘absences’ (number of school absences) and ‘famrel’ (quality of family relations). We also hypothesize that there is a correlation between being a male and high consumption of alcohol.
Having high alcohol consumption should in principle be correlated with the number of past failures, since the student might have a serious alcohol problem, thus creating high number of failures.
Similarly, high alcohol consumption is expected to be correlated with high number of absences, since if the student is intoxicated almost always, then attending a class becomes difficult if not impossible.
For family relations, we expect that bad family relations is correlated with high alcohol consumption, since students may try to escape from troublesome relations at home and alcohol is one such solution.
Finally, we expect high alcohol consumption from male students, but we accept that this could be read off as a sexist expectation.
We now draw some plots regarding high alcohol usage versus the hypothesized variables above:
# put the hypothesized variables in new data frame
keep_columns <- c("high_use", "failures", "absences", "famrel", "sex")
alc_hypo <- select(alc, one_of(keep_columns))
Let’s now draw a scatter plot to first summarize everything:
ggpairs(alc_hypo, mapping = aes(col=sex, alpha=0.3), lower = list(combo = wrap("facethist", bins = 20)))
Now let’s start with a bar plot between high alcohol consumption and sex:
# initialize a plot of 'high_use'
g1 <- ggplot(data = alc, aes(x = high_use))
# draw a bar plot of high_use by sex
g1 + geom_bar()+facet_wrap("sex")
We can see that there can definitely be some correlation with being a male and having high alcohol consumption. Percentage of females who drink is very small compared to females who do not. But this ratio increases for males.
We now construct a bar plot of each variable:
# initialize a plot of 'high_use'
g2 <- ggplot(data = alc, aes(x = high_use))
# draw a bar plot of high_use by failures
g2 + geom_bar()+facet_wrap("failures")
We see that eventually, similar to sex, the ratio of high to low alcohol consumption increases as the number of past failures increase. So there could be some correlation.
# initialize a plot of 'high_use'
g3 <- ggplot(data = alc, aes(x = high_use))
# draw a bar plot of high_use by absences
g3 + geom_bar()+facet_wrap("absences")
We see that similar to ‘sex’ and ‘failures’, the ratio of high to low alcohol consumption increases as the number of absences increase. So there could again be some correlation.
# initialize a plot of 'high_use'
g3 <- ggplot(data = alc, aes(x = high_use))
# draw a bar plot of high_use by family relations
g3 + geom_bar()+facet_wrap("famrel")
Finally for family relations, we again have a similar situation, but it is a little bit complicated. Overall, it seems again that the ratio of high to low alcohol consumption increases as the family relations get worse.
We also draw a bar plot which includes all of our explanatory variables, together with the dependent variable:
# draw a bar plot of each variable
gather(alc_hypo) %>% ggplot(aes(value)) + geom_bar()+ facet_wrap("key", scales = "free")
Finally, a boxplot of family relations and absences by alcohol consumption and sex:
# initialize a plot of high_use and family relations
g1 <- ggplot(alc, aes(x = high_use, y = famrel, col = sex))
# define the plot as a boxplot and draw it
g1 + geom_boxplot() + ylab("family relations")+ggtitle("Student family relations by alcohol consumption and sex")
# initialize a plot of high_use and absences
g2<- ggplot(alc, aes(x = high_use, y = absences, col = sex))
# define the plot as a box plot and draw it
g2 + geom_boxplot() + ylab("absences") +ggtitle("Student absences by alcohol consumption and sex")
Overall observations
We see that there could be some correlation between the hypothesized explanatory variables (failures, absences, sex, family relations) and the dependent variable (high alcohol consumption). We will further analyze this.
We now move onto a more statistical way of showing why our hypotheses are (significantly) true. We will use logistic regression to accomplish this:
# find the model with glm()
m <- glm(high_use ~ failures + absences + sex + famrel, data = alc, family = "binomial")
# print out a summary of the model
summary(m)
##
## Call:
## glm(formula = high_use ~ failures + absences + sex + famrel,
## family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.0786 -0.8216 -0.5746 0.9760 2.1820
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.79406 0.54281 -1.463 0.14350
## failures 0.57328 0.20531 2.792 0.00523 **
## absences 0.08941 0.02274 3.932 8.43e-05 ***
## sexM 1.04800 0.25091 4.177 2.96e-05 ***
## famrel -0.29791 0.13044 -2.284 0.02238 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 401.77 on 365 degrees of freedom
## AIC: 411.77
##
## Number of Fisher Scoring iterations: 4
# print out the coefficients of the model
coef(m)
## (Intercept) failures absences sexM famrel
## -0.79406437 0.57327802 0.08940969 1.04800182 -0.29791173
# compute odds ratios (OR)
OR <- coef(m) %>% exp
# compute confidence intervals (CI)
CI<- exp(confint(m))
## Waiting for profiling to be done...
# print out the odds ratios with their confidence intervals
cbind(OR, CI)
## OR 2.5 % 97.5 %
## (Intercept) 0.4520039 0.1532656 1.2982302
## failures 1.7740730 1.1936470 2.6881229
## absences 1.0935286 1.0480739 1.1462668
## sexM 2.8519467 1.7556470 4.7047758
## famrel 0.7423669 0.5735490 0.9583804
We now interpret the summary. Observe that absences and sex (male) have a highly significant (positive, since coefficient is positive) correlation with high_use, with p-value less than 0.001. Failures have a significant (positive) correlation with high_use, with p-value between 0.01 and 0.001. Finally, family relations have a significant (negative, since the coefficient is negative) correlation with high_use, with p-value between 0.05 and 0.01. All of our hypotheses can be accepted and are indeed significant enough. If one surmises that the p-value should be less than 0.01 to achieve even greater significance, then family relations loses its significant correlation with high-use.
We now interpret the coefficients as odd ratios. Note that the exponentials of the coefficients of a logistic regression model can be interpreted as odds ratios between a unit change (vs. no change) in the corresponding explanatory variable:
We see that odd ratio of failure is about 1.77. This means that for each unit of failure, the increase in odds of having a student with high alcohol consumption is about 1.77 times. Thus, more failures mean higher odds of having high alcohol consumption, as hypothesized earlier.
Similarly, for each unit of absences, the increase in odds of having a student with high alcohol consumption is about 1.09 times, which is very close to 1, thus there is almost no change in high alcohol consumption, but it is still greater than 1, so it is in line with our hypothesis.
Odds ratio for sex (male) is about 2,85, which indicates that changing sex (i.e. increasing the odds of being a male), alters the odds of having a student with high alcohol consumption the most. This is also in line with our hypothesis, since we said that being a male should be positively correlated with high alcohol consumption.
Finally, odds ratio of family relations is less than 1, which means that we are losing in the odds of having high alcohol consumption if we increase family relations. This also is in line with our hypothesis: better family relations=low alcohol consumption.
We compute the predictive power of the model with failures, absences, sex and family relations as our explanatory variables and high_use as the dependent variable. We excluded none of the initial choice for the explanatory variables, since in the last section we found a significant correlation between them and high_use.
# fit the model
m <- glm(high_use ~ failures + absences + sex + famrel, data = alc, family = "binomial")
# predict() the probability of high_use
probabilities <- predict(m, type = "response")
library(dplyr)
# add the predicted probabilities to 'alc'
alc <- mutate(alc, probability = probabilities)
# use the probabilities to make a prediction of high_use
alc <- mutate(alc, prediction = probability>0.5)
# see the last ten original classes, predicted probabilities, and class predictions
select(alc, failures, absences, sex, famrel, high_use, probability, prediction) %>% tail(10)
## failures absences sex famrel high_use probability prediction
## 361 0 3 M 4 FALSE 0.3386132 FALSE
## 362 1 0 M 4 FALSE 0.4098873 FALSE
## 363 1 7 M 5 TRUE 0.4908822 FALSE
## 364 0 1 F 5 FALSE 0.1002713 FALSE
## 365 0 6 F 4 FALSE 0.1901165 FALSE
## 366 1 2 F 5 FALSE 0.1777706 FALSE
## 367 0 2 F 4 FALSE 0.1410142 FALSE
## 368 0 3 F 1 FALSE 0.3049689 FALSE
## 369 0 4 M 2 TRUE 0.5039381 TRUE
## 370 0 2 M 4 TRUE 0.3188873 FALSE
# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction)
## prediction
## high_use FALSE TRUE
## FALSE 244 15
## TRUE 77 34
We see that our model correctly predicts 244 false and 34 true observations. The rest are inaccurately classified individuals. We can graph the actual values vs predictions:
# initialize a plot of 'high_use' versus 'probability' in 'alc'
g <- ggplot(alc, aes(x = probability, y = high_use),aes(col=prediction))
# define the geom as points and draw the plot
g + geom_point()
# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction) %>% prop.table() %>%addmargins()
## prediction
## high_use FALSE TRUE Sum
## FALSE 0.65945946 0.04054054 0.70000000
## TRUE 0.20810811 0.09189189 0.30000000
## Sum 0.86756757 0.13243243 1.00000000
We now compute the average number of inaccurately classified individuals:
# Work with the exercise in this chunk, step-by-step. Fix the R code!
# the logistic regression model m and dataset alc with predictions are available
# define a loss function (mean prediction error)
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
# call loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2486486
We find a number of about 0.25. This means that on average, 1 out of 4 people are inaccurately classified, meaning that they are falsely accused of heavy drinking while actually being light drinkers, or vice versa.
We perform 10-fold cross validation:
# K-fold cross-validation
library(boot)
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m, K = 10)
# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2459459
We obtain a number of about 0.26, which is the same if not a little bit worse than the predictions in the Exercise. Thus the test performance is almost identical. This is largely due to family relations having a small impact on the dependent variable, compared to sex or failures, thus including family relations did not create a better model, and may in fact worsen it.
This week I have worked on clustering and classification. This week definitely felt much more easier for me.
date()
## [1] "Tue Nov 29 05:24:30 2022"
This week, data wrangling felt even easier than the last week. I mostly used some help from create_alc.R. The R code of the data wrangling part is in the data folder of my Github repository. I will put the link here as well: https://github.com/bbayraktaroglu/IODS-project/blob/master/data/create_human.R
library(MASS)
library(dplyr)
library(tidyr)
library(tidyverse)
library(corrplot)
library(ggplot2)
library(plotly)
# reading the required file for the assignment
data("Boston")
# checking out its dimension, structure and summary
dim(Boston)
## [1] 506 14
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
In the Boston dataset, there are 506 observations and 14 variables. It is included in the MASS package of R. This data frame contains gathered data related to housing values in suburbs of Boston. Most of the variables are numeric (float), while “chas” and “rad” variables are integers.
Let’s put our newly learned knowledge about correlation plots to good use. The following is the correlation matrix and its various plots of the Boston data:
# calculating the correlation matrix, also round it to 2 digits
cor_matrix <- cor(Boston) %>% round(digits=2)
# print the correlation matrix
print(cor_matrix)
## crim zn indus chas nox rm age dis rad tax ptratio
## crim 1.00 -0.20 0.41 -0.06 0.42 -0.22 0.35 -0.38 0.63 0.58 0.29
## zn -0.20 1.00 -0.53 -0.04 -0.52 0.31 -0.57 0.66 -0.31 -0.31 -0.39
## indus 0.41 -0.53 1.00 0.06 0.76 -0.39 0.64 -0.71 0.60 0.72 0.38
## chas -0.06 -0.04 0.06 1.00 0.09 0.09 0.09 -0.10 -0.01 -0.04 -0.12
## nox 0.42 -0.52 0.76 0.09 1.00 -0.30 0.73 -0.77 0.61 0.67 0.19
## rm -0.22 0.31 -0.39 0.09 -0.30 1.00 -0.24 0.21 -0.21 -0.29 -0.36
## age 0.35 -0.57 0.64 0.09 0.73 -0.24 1.00 -0.75 0.46 0.51 0.26
## dis -0.38 0.66 -0.71 -0.10 -0.77 0.21 -0.75 1.00 -0.49 -0.53 -0.23
## rad 0.63 -0.31 0.60 -0.01 0.61 -0.21 0.46 -0.49 1.00 0.91 0.46
## tax 0.58 -0.31 0.72 -0.04 0.67 -0.29 0.51 -0.53 0.91 1.00 0.46
## ptratio 0.29 -0.39 0.38 -0.12 0.19 -0.36 0.26 -0.23 0.46 0.46 1.00
## black -0.39 0.18 -0.36 0.05 -0.38 0.13 -0.27 0.29 -0.44 -0.44 -0.18
## lstat 0.46 -0.41 0.60 -0.05 0.59 -0.61 0.60 -0.50 0.49 0.54 0.37
## medv -0.39 0.36 -0.48 0.18 -0.43 0.70 -0.38 0.25 -0.38 -0.47 -0.51
## black lstat medv
## crim -0.39 0.46 -0.39
## zn 0.18 -0.41 0.36
## indus -0.36 0.60 -0.48
## chas 0.05 -0.05 0.18
## nox -0.38 0.59 -0.43
## rm 0.13 -0.61 0.70
## age -0.27 0.60 -0.38
## dis 0.29 -0.50 0.25
## rad -0.44 0.49 -0.38
## tax -0.44 0.54 -0.47
## ptratio -0.18 0.37 -0.51
## black 1.00 -0.37 0.33
## lstat -0.37 1.00 -0.74
## medv 0.33 -0.74 1.00
# visualize the correlation matrix
library(corrplot)
corrplot(cor_matrix, method="circle")
corrplot(cor_matrix, method="number")
Observe that most of the variables are more or less correlated with each other, but the “chas” variable is mostly correlated with itself, while having correlation very close to 0 with the other variables. We know from basic probability theory that uncorrelated data does not imply independence, so we cannot infer that “chas” is independent from the other variables. We can only say that it is almost uncorrelated from the other variables. “rad” and “indus” has high overall positive correlation with most of the other variables (except “chas”). “rad” has 0.91 correlation with “tax” and 0.72 with “indus”. “indus” has -0.71 correlation (strong negative correlation) with “dis”, while “nox” has -0.77 correlation with “dis”.
We will scale the data by subtract the column means from the corresponding columns and divide the difference with standard deviation. This normalizes the variables to be centered with standard deviation 1.
# scaling the Boston
boston_scaled <- as.data.frame(scale(Boston))
# summaries of the scaled variables
summary(boston_scaled)
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
We now create a categorical variable of the crime rate in the Boston dataset (from the scaled crime rate). We will use the quantiles as the break points in the categorical variable.
We will then drop the old crime rate variable from the dataset. Afterwards, we divide the dataset to train and test sets, so that 80% of the data belongs to the train set.
# creating a categorical variable called "crime" from scaled crime rate
boston_scaled$crim <- as.numeric(boston_scaled$crim)
crime <- cut(boston_scaled$crim, breaks = quantile(boston_scaled$crim), include.lowest = TRUE, label=c("low", "med_low", "med_high", "high"))
# remove original crim from the dataset
boston_scaled <- boston_scaled %>% dplyr::select(-crim)
# add the new categorical variable to scaled data
boston_scaled <- data.frame(boston_scaled, crime)
# number of rows in the Boston dataset
n <- nrow(boston_scaled)
# choose randomly 80% of the rows
ind <- sample(n, size = n * 0.8)
# creating the train set
train <- boston_scaled[ind,]
# creating the test set
test <- boston_scaled[-ind,]
We will now fit the linear discriminant analysis on the train set. We will use the categorical crime rate as the target variable and all the other variables in the dataset as predictor variables. We then draw the LDA (bi)plot.
# linear discriminant analysis
lda.fit <- lda(crime ~ . , data = train)
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(train$crime)
# plot the lda results
plot(lda.fit, dimen = 2, col=classes, pch=classes)
lda.arrows(lda.fit, myscale = 1)
Observe that, the LDA plot predicts “rad” has the most variation in the
dataset, towards the mostly “high” cluster.
set.seed(123)
# saving the correct classes from test data
correct_classes <-test$crime
# removing the crime variable from test data
test <- dplyr::select(test, -crime)
# predicting classes with test data
lda.pred <- predict(lda.fit, newdata = test)
# cross tabulating the results
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 25 12 1 0
## med_low 2 13 8 0
## med_high 0 10 7 1
## high 0 0 0 23
We find that almost all of the results are accurately predicted. Correctly classified observations are about 67, while the rest (about 35) are incorrectly classified. The inaccuracy rate of the LDA is about 34% (can be as low as 23% in some other sampling with another other seed).
We reload Boston, rescale it and compute its Euclidean distance.
# reload the data
data("Boston")
# scale the data again
boston_scaled <- as.data.frame(scale(Boston))
# compute the Euclidean distance of Boston
dist_eu <- dist(boston_scaled)
# summary of dist_eu
summary(dist_eu)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
We now run the k-means algorithm:
set.seed(123)
# determine the number of clusters
k_max <- 10
# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(Boston, k)$tot.withinss})
# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')
## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
The above plot extensively shows us that there is a significant drop at
the value 2. Thus, the optimal number of clusters is 2.
We now run k-means algorithm again, this time with 2 clusters, and plot the Boston dataset with the clusters. The clusters will be colored in red and black.
set.seed(123)
# k-means clustering with 2 clusters
km <- kmeans(Boston, centers = 2)
# plot the Boston dataset with clusters
pairs(Boston, col = km$cluster)
If one zooms in to the plot above, one would see that “rad” has nicely separated clusters across all of the possible pairings. “tax” also has good separation of clusters. The other variables are a complete mess, and no other conclusion can be drawn.
We will now perform k-means algorithm on the original Boston data (after scaling). We choose 5 clusters. We then perform LDA using the clusters as target classes. We will include all the variables in the Boston data in the LDA model.
set.seed(5)
# reload the data
data("Boston")
# scale the data again
boston_scaled <- as.data.frame(scale(Boston))
# k-means clustering with 5 clusters
km <- kmeans(Boston, centers = 5)
# linear discriminant analysis on the clusters, with data=boston_scaled, and target variable km$cluster
lda.fit <- lda(km$cluster ~ ., data = boston_scaled)
# target classes as numeric
classes <- as.numeric(km$cluster)
# plot the lda results. Note that lda.arrows is the same function we have used above
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 1)
Visualize the results with a biplot (include arrows representing the
relationships of the original variables to the LDA solution). Interpret
the results. Which variables are the most influential linear separators
for the clusters?
We observe in the above biplot that “tax” and “rad” have the most variation in the dataset. Moreover, the K-means seems to form accurate and separate clusters.
We will recall the code for the (scaled) train data that we used to fit the LDA. We then create a matrix product, which is a projection of the data points.
set.seed(123)
# LDA
lda.fit <- lda(crime ~ ., data = train)
model_predictors <- dplyr::select(train, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404 13
dim(lda.fit$scaling)
## [1] 13 3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
We now create a 3D plot of the columns of the matrix product:
library(plotly)
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color=~train$crime)
Now let’s run the k-means algorithm on the matrix product with 4 clusters (since the number of clusters of crime is 4), and draw another 3D plot where the color is defined by the clusters of the k-means.
set.seed(5)
km = kmeans(model_predictors, centers = 4)
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color=~factor(km$cluster))
The k-means clustering is mostly successful. One can see that there are 2 superclusters, while the clusters 1,2,4 (mostly) form their own subclusters under one of the superclusters. The cluster 3 is shared between the huge clusters. In the clusters for “crime”, “med_high” has this same property, while the other clusters are nicely separated into two superclusters. Thus, the k-means clustering plot with 4 clusters seems to give similar results compared to the lda.fit of the “crime” variable.
(more chapters to be added similarly as we proceed with the course!)